- Create a PDF document and a website to communicate your own analysis
- Including your own text, analysis, table and chart
- Using data from an external source (a CSV file)
- Understand the data science workflow in R
- Gain confidence in R
Use RStudio for everything. R is the engine; RStudio is the Interface. If you do half your data cleaning in Excel, there will be no record of it and we won't be able to fix mistakes.
Our work is in the form of a 'Recipe Book': A step-by-step guide to the data inputs (the ingredients), our analysis (the cooking instructions) and the outputs (the picture of the perfect meal).
Use R Markdown (.rmd) files to combine analysis and outputs: R markdown files allow us to combine data processing in R - in clearly-defined 'chunks' - with the display of text, tables, charts and maps.
Output is produced only when we press 'Knit': It is NOT an interactive playground like excel (though it can be with Ctrl+Enter).
Our work should be self-explanatory and reproducible: Anybody with R should be able to open our work, press 'Knit' and produce the same outputs.
Organize your work in Projects in R: For each major analysis, it's best to choose 'File' -> 'New Project' -> 'New Directory' from Rstudio. Save all your data inputs and outputs in this folder (which Rstudio will do automatically).
Data frames (tables) are the main building block of our analysis: We focus on manipulating and visualizing tables of data, as these are the best way of organizing our data.
Use meaningful names in your work: 'data_v1b_2_061215' won't mean anything in 3 months! All files and objects should reflect their role in the analysis.
Process our data in a 'tidy' way: This means we will use a set of compatible 'packages' called the 'tidyverse' to make our analysis transparent and avoid common problems.
$$ a^2 + b^2 = c^2 $$\[ a^2 + b^2 = c^2 \]
new_object <- old_objectdata_frame %>% action_on_dataframe#Comments go here and won't be processed by Rinstall.packages("New_package") ONCE, thenlibrary("New_package") at the start of each documentanswer <- 2 + 2 answer
## [1] 4
inputs <- seq(0,1,0.2) answer <- inputs*10 answer
## [1] 0 2 4 6 8 10
Set the title, date, author etc. in the header in markdown
read_csvfilter, select, mutateleft_joinmutate, summarizezeligkable, stargazerggplotleaflet, mapviewread_csvdata <- read_csv("data.csv")
library(foreign)
data <- read.spss("data.sav")
data <- read.dta("data.dta")
flights %>%
group_by(carrier) %>%
summarize(dep_delay=mean(dep_delay)) %>%
kable()
flights %>%
group_by(carrier) %>%
summarize(dep_delay=mean(dep_delay)) %>%
ggplot() + geom_point(aes(x=air_time,y=dep_delay))
flights %>%
group_by(carrier) %>%
summarize(dep_delay=mean(dep_delay)) %>%
zelig(dep_delay ~ carrier,data=.,model="ls") %>%
stargazer(digits=3)
select specific variables (columns)slice observations (rows)filter observations (rows) by conditions (based on values in columns)count number of observationsrename variables (columns)arrange table in the order of a particular variablemutate (change) values of an existing or create a new variablesummarize data by creating statisticsround values to a specifc number of decimal placesflights %>% select(carrier,origin,air_time,distance,dep_delay)
flights %>% select(carrier,origin,air_time,distance,dep_delay)
## # A tibble: 5 x 1 ## air_time ## <dbl> ## 1 227 ## 2 227 ## 3 160 ## 4 183 ## 5 116
flights %>% slice(1:2)
flights %>% slice(1:2)
## # A tibble: 2 x 5 ## carrier origin air_time distance dep_delay ## <chr> <chr> <dbl> <dbl> <dbl> ## 1 UA EWR 227 1400 2 ## 2 UA LGA 227 1416 4
flights %>% filter(origin=="JFK")
flights %>% filter(origin=="JFK")
## # A tibble: 2 x 5 ## carrier origin air_time distance dep_delay ## <chr> <chr> <dbl> <dbl> <dbl> ## 1 AA JFK 160 1089 2 ## 2 B6 JFK 183 1576 -1
flights %>% mutate(air_time=round(air_time/60,3))
flights %>% mutate(air_time=round(air_time/60,3))
## # A tibble: 5 x 5 ## carrier origin air_time distance dep_delay ## <chr> <chr> <dbl> <dbl> <dbl> ## 1 UA EWR 3.783 1400 2 ## 2 UA LGA 3.783 1416 4 ## 3 AA JFK 2.667 1089 2 ## 4 B6 JFK 3.050 1576 -1 ## 5 DL LGA 1.933 762 -6
flights %>% mutate(speed=round(distance/air_time,3))
flights %>% mutate(speed=round(distance/air_time,3))
## # A tibble: 5 x 6 ## carrier origin air_time distance dep_delay speed ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 UA EWR 227 1400 2 6.167 ## 2 UA LGA 227 1416 4 6.238 ## 3 AA JFK 160 1089 2 6.806 ## 4 B6 JFK 183 1576 -1 8.612 ## 5 DL LGA 116 762 -6 6.569
flights %>% summarize(avg_distance=mean(distance,na.rm=TRUE))
flights %>% summarize(avg_distance=mean(distance,na.rm=TRUE))
## # A tibble: 1 x 1 ## avg_distance ## <dbl> ## 1 1248.6
These actions can be 'piped' together:
We want to find the
average
speed of
United (UA) flights.
In steps:
Take the data,
filter the data to carrier UA,
calculate the speed of each flight,
and then
find the average.
flights %>% filter(carrier=="UA") %>% mutate(speed=distance/(air_time/60)) %>% summarize(avg_speed=mean(speed,na.rm=TRUE)) %>% round(1)
flights %>% filter(carrier=="UA") %>% mutate(speed=distance/(air_time/60)) %>% summarize(avg_speed=mean(speed,na.rm=TRUE)) %>% round(1) %>% as.numeric()
## [1] 420.9
These actions can be 'piped' together:
We want to find the
average
speed of
United (UA) flights.
In steps:
Take the data,
filter the data to carrier UA,
calculate the speed of each flight,
and then
find the average.
avg_speed <- flights %>% filter(carrier=="UA") %>% mutate(speed=distance/(air_time/60)) %>% summarize(avg_speed=mean(speed,na.rm=TRUE)) %>% round(1)
The average speed of United Flights is `r avg_speed` miles per hour.
The average speed of United Flights is 420.9 miles per hour.
Can we change the order of data processing?
In steps:
Take the data,
calculate the speed of each flight,
filter the data to carrier UA,
and then
find the average.
flights %>%
mutate(speed=distance/(air_time/60)) %>%
filter(carrier=="UA") %>%
summarize(avg_speed=mean(speed,na.rm=TRUE)) %>%
round(1)
Can we change the order of data processing?
In steps:
Take the data,
calculate the speed of each flight,
filter the data to carrier UA,
and then
find the average.
flights %>%
mutate(speed=distance/(air_time/60)) %>%
filter(carrier=="UA") %>%
summarize(avg_speed=mean(speed,na.rm=TRUE)) %>%
round(1)
420.9
Can we change the order of data processing?
In steps:
Take the data,
calculate the speed of each flight,
find the average,
and then
filter the data to carrier UA,
flights %>%
mutate(speed=distance/(air_time/60)) %>%
summarize(avg_speed=mean(speed,na.rm=TRUE)) %>%
filter(carrier=="UA") %>%
round(1)
Can we change the order of data processing?
In steps:
Take the data,
calculate the speed of each flight,
find the average,
and then
filter the data to carrier UA,
flights %>%
mutate(speed=distance/(air_time/60)) %>%
summarize(avg_speed=mean(speed,na.rm=TRUE)) %>%
filter(carrier=="UA") %>%
round(1)
394.3
flights %>% slice(1:5) %>% select(carrier,origin,air_time,distance,dep_delay) %>% kable()
| carrier | origin | air_time | distance | dep_delay |
|---|---|---|---|---|
| UA | EWR | 227 | 1400 | 2 |
| UA | LGA | 227 | 1416 | 4 |
| AA | JFK | 160 | 1089 | 2 |
| B6 | JFK | 183 | 1576 | -1 |
| DL | LGA | 116 | 762 | -6 |
flights %>% slice(1:5) %>% select(carrier,origin,air_time,distance,dep_delay) %>% kable(caption="Example Table", align="lcccc")
| carrier | origin | air_time | distance | dep_delay |
|---|---|---|---|---|
| UA | EWR | 227 | 1400 | 2 |
| UA | LGA | 227 | 1416 | 4 |
| AA | JFK | 160 | 1089 | 2 |
| B6 | JFK | 183 | 1576 | -1 |
| DL | LGA | 116 | 762 | -6 |
flights %>% filter(carrier=="UA") %>% ggplot() + geom_point(aes(x=dep_time,y=dep_delay))
flights %>% filter(carrier=="UA") %>% ggplot() + geom_point(aes(x=dep_time,y=dep_delay)) + geom_smooth(aes(x=dep_time,y=dep_delay))
flights %>%
filter(carrier=="UA") %>%
ggplot() + geom_point(aes(x=dep_time,y=dep_delay)) +
geom_smooth(aes(x=dep_time,y=dep_delay)) +
ggtitle("Example Chart") +
xlab("Departure Time") +
ylab("Departure Delay")
flights %>% ggplot() + geom_bar(aes(x=dep_delay))
flights %>% ggplot() + geom_bar(aes(x=dep_delay)) + xlim(-30,100)
flights %>% group_by(origin) %>% summarize(avg_delay=mean(dep_delay,na.rm=TRUE)) %>% ggplot() + geom_col(aes(x=origin, y=avg_delay))